Machine Meets Man: evaluating the psychological reality of corpus-based probabilistic models
نویسندگان
چکیده
Linguistic convention allows speakers various options. Evidence is accumulating that the various options are preferred in different contexts yet the criteria governing the selection of the appropriate form are often far from obvious. Most researchers who attempt to discover the factors determining a preference rely on the linguistic analysis and statistical modeling of data extracted from large corpora. In this paper, we address the question of how to evaluate such models and explicitly compare the performance of a statistical model derived from a corpus with that of native speakers in selecting one of six Russian TRY verbs. Building on earlier work by Divjak (2003, 2004, 2010) and Divjak & Arppe (2013), we trained a polytomous logistic regression model to predict verb choice given the context. We compare the predictions the model makes for 60 unseen sentences to the choices adult native speakers make in those same sentences. We then look in more detail at the interplay of the contextual properties and model computationally how individual differences in assessing the importance of contextual properties may impact the linguistic knowledge of native speakers. Finally, we compare the probability the model assigns to encountering each of the 6 verbs in the 60 test sentences to the acceptability ratings the adult native speakers give to those sentences. We discuss the implications of our findings for both usage-based theory and empirical linguistic methodology. Acknowledgments The experiment received ethical approval from the University of Sheffield, School of Languages & Cultures; the data were collected in 2013. The financial support of the University of Sheffield in the form of a 2013 SURE summer research internship to Clare Gallagher is gratefully acknowledged; Clare set up the acceptability ratings study and recruited participants for this task. 1 Note that we use the word predict in the statistical sense, i.e., “identify as the most likely choice, given the data the model was trained on”. The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 2 Introduction A particular idea can often be coded linguistically in several different ways: that is to say, linguistic convention allows speakers various options. At the lexical level, speakers can choose from sets of near synonyms (walk, march, stride, strut...). Similarly, at the grammatical level, there are often several options for encoding slightly different construals of the same situation: for instance, in English, there are several ways of marking past events (was walking, walked, had walked), two indirect object constructions (give him the book vs give the book to him), and so on. Cognitive linguists have long been claiming that languages abhor (complete) synonymy and evidence is accumulating showing that in the vast majority of cases, the various options are preferred in different contexts. However, the criteria governing the selection of the appropriate form are often far from obvious, and hence, there is now a considerable amount of empirical work attempting to describe the differences between near synonymous lexemes or constructions (for book-length treatments see Arppe 2008, Divjak 2010, Klavan 2012 and references therein). Most researchers who attempt to discover the factors influencing a speaker's decision to use a particular form rely on the analysis of large corpora. A typical analysis involves extracting a large number of examples from a corpus and coding them for a number of potentially relevant features (Klavan 2012) or even as many potentially relevant features as possible (Arppe 2008, Divjak 2010). The usage patterns obtained can then be analyzed statistically to determine which of the candidate features are predictive of the form which is the focus of the study. The most rigorous studies also fit a statistical model to the data and test it on a new set of corpus examples (the testing set) to see how well it generalizes to new data. The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 3 One problem faced by researchers in this area is how to evaluate such models. A model that supplies the target form 85% of the time may be regarded as better than one that predicts it 80% of the time – but can this be regarded as adequate? After all, such a model still gets it wrong 15% of the time! The answer, of course, depends partly on (1) how many options there are to choose from (51% correct is very poor if there are only two options, but would be impressive if there were ten), but also on (2) the degree to which the phenomenon is predictable (100% correct is not a realistic target if the phenomenon is not fully predictable), as well as (3) what is being predicted: individual choices or rather proportions of choices over time. As Kilgariff (2005) and many others have observed: language is never ever random; however, it is also rarely, if ever, fully predictable. The obvious solution for cognitive linguists is to compare the model's performance to that of native speakers of the language. Such a comparison could, in principle, result in three possible outcomes. First, the model may perform less well than humans. If this is the case, then the model is clearly missing something, and this tells us that we must go back to the data and find out what we have not coded for, add new predictors to the model, and test it again. Secondly, the model may perform as well as humans. This is clearly an encouraging outcome, but if we are interested in developing a psychologically realistic model (as opposed to simply describing the corpus data), we would want to make sure that the model is relying on the same criteria as the speakers. We could conclude that this was the case if the pattern of performance was similar, that is to say, if the model gives clear predictions (i.e., outputs a high probability for one particular option) when the speakers consistently choose the same option, and, conversely, if uncertainty in the The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 4 model (several options with roughly equal predicted probabilities, of e.g. 0.2-0.3 in the case of 3-5 alternatives, as opposed to one clear favourite) corresponded to variability in human responses. Finally, the model may perform better than humans. Statistical models have been found to outperform human experts in a number of areas including medical diagnosis, academic and job performance, probation success, and likelihood of criminal behaviour (Dawes, Faust and Meehl 1989, Grove et al. 2000, Stanovich 2010). To our knowledge, no model of linguistic phenomena currently performs better than humans (for instance, is able to choose the form that actually occurred in a particular context in a corpus more accurately than the average human informant) but it is perfectly possible that, as our methods improve, such models will be developed. 1. Previous studies There are now a number of published multivariate models that use data, extracted from corpora and annotated for a multitude of morphological, syntactic, semantic and pragmatic parameters, to predict the choice for one morpheme, lexeme or construction over another. However, most of these studies are concerned with phenomena that involve binary choices (Gries 2003, De Sutter et al. 2008) and only a small number of these corpus-based studies have been cross-validated (Keller 2000, Sorace & Keller 2005, Wasow & Arnold 2003, Roland et al. 2006, Arppe & Järvikivi 2007, Divjak & Gries 2008). Of these cross-validated studies, few have directly 2 There are a number of early studies that employ multiple explanatory variables but do not use these to construct multivariate models. Instead, they consider all possible unique variable-value combinations as distinct conditions (e.g. Gries 2002, Featherston 2005). 3 Note that Grondelaers & Speelman (2007) and Kempen & Harbusch (2005) work the other way around and validate and refine experimental findings using corpus data. The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 5 evaluated the prediction accuracy of a complex, multivariate corpus-based model on humans using authentic corpus sentences (with the exception of Bresnan 2007, Bresnan & Ford 2010, Ford & Bresnan 2012, Ford & Bresnan 2013), and even fewer have attempted to evaluate the prediction accuracy of a polytomous corpus-based model in this way (but see Arppe & Abdulrahim 2013 for a first attempt). Below we will review the latter two types of cross-validated studies. Bresnan (2007) was the first to evaluate a multivariate corpus-based model (Bresnan et al. 2007) designed to predict the binary dative alternation. A scalar rating task was used to evaluate the correlation between the naturalness of the alternative syntactic paraphrases and the corpus probabilities. Materials consisted of authentic passages attested in a corpus of transcriptions of spoken dialogue; the passages were randomly sampled from the centers of five equally sized probability bins, ranging from a very low to a very high probability of having a preposition dative construction. For each sampled observation the alternative paraphrase was constructed. Both options were presented as choices in the original dialogue context. Contexts were only edited for readability by shortening and by removing disfluencies. Items were pseudo-randomized and construction choices were alternated to make up a questionnaire. Each of the 19 subjects received the same questionnaire, with the same order of items and construction choices. Subjects were asked to rate the naturalness of alternatives in a given context by distributing 100 points over both options. Responses were analysed as a function of the original corpus model predictor variables by using mixed effects logistic regression. Bresnan found that subjects’ scores of the naturalness of the alternative syntactic paraphrases correlate well (R = 0.61) 4 Arppe & Järvikivi (2007) criticize Bresnan’s set-up of operationalizing naturalness as a zero-sum game, with naturalness between the two alternatives always adding up to the same value, i.e. 100, as The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 6 with the corpus probabilities and can be explained as a function of the same predictors. Individual speakers’ choices matched the choice attested in the corpus in 63% to 87% of all cases (with a baseline of 57% correct by always choosing the most frequently occurring option). Bresnan concluded that language users’ implicit knowledge of the dative alternation in context reflects the usage probabilities of the construction. Ford & Bresnan (2010, 2012, 2013) investigated the same question across American and Australian varieties of English. Relevant here is that they ran a continuous lexical decision task (Ford 1983) to check whether lexical-decision latencies during a reading task reflect the corpus probabilities. In a continuous lexical decision task subjects read a sentence word by word at their own pace, and make a lexical decision as they read each word (participants are presented with a sentence one word at a time and must press a “yes” or “no” button depending on whether the “word” is a real word or a non-word). The participants were instructed to read the contextual passage first and then make a lexical decision for all words from a specific starting point. That starting point was always the word before the dative verb. There were 24 experimental items, chosen from the 30 corpus items used in the scalar rating task (Bresnan 2007). A mixed effects model fit to the data confirmed that lexical-decision latencies during a reading task reflect the corpus probabilities: more probable sentence types require fewer resources during reading, so that RTs measured in the task decrease in high-probability examples. their own study shows that even strong differences in terms of preference might nevertheless exhibit relatively small differences in acceptability. However, Bresnan’s results would seem to indicate that the human participants were agreeing with the corpus-based estimates of the proportions of choice (in the long run) between the two alternatives (rather than with their naturalness). Of course, we cannot be sure what participants in a experiment are doing, regardless of how the instructions are formulated (cf. Penke & Rosenbach 2004). The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 7 Arppe & Abdulrahim (2013) contrast corpus data and force-choice data on 4 near-synonymous verbs meaning come in Modern Standard Arabic to assess the extent to which regularities extracted from a corpus overlap with collective intuitions of native speakers. A model of the corpus data was built using polytomous logistic regression based on the one-vs-all heuristic (Arppe 2008, 2013a) and was compared to data from a forced-choice task completed by 30 literate Bahraini native speakers of Arabic who read 50 sentences and chose the missing verb from a given list of verbs. The 50 experimental stimuli were chosen to represent the full breadth of contextual richness in the corpus data and the entire diversity of probability distributions, ranging from near-categorical preferences for one verb to approximately equal probability distributions for all four verbs. Arppe & Abdulrahim (2013) found that as the probability of a verb, given the context, rises, so does the proportion of selections of that verb in the context in question (proportion being the relative number of participants selecting the particular verb). Importantly there are hardly any cases where a low-probability verb would have received a high proportion of choices, and only a few in which high-probability verb would have received a low proportion of choices. 2. Russian verbs of trying In this paper, we explicitly compare the performance of a statistical model derived from a corpus with that of native speakers. The specific phenomenon that we will investigate concerns six Russian verbs (probovat’, silit’sja, pytat’sja, norovit’, starat’sja, poryvat'sja) which are similar in meaning – they can all be translated with the English verb try – but which are not fully synonymous. As explained in Divjak The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 8 (2010: 1-14), these verbs were selected as near-synonyms on the basis of a distributional analysis in the tradition of Harris (1954) and Firth (1957), with meaning construed as contextual in the Wittgensteinian sense. Synonymy was thus operationalized as mutual substitutability or interchangeability within a set of constructions, forming a shared constructional network. This is motivated by a Construction Grammar approach to language in which both constructions and lexemes are considered to have meaning; as a consequence, the lexeme’s meaning has to be compatible with the meaning of the construction in which it occurs and of the constructional slot it occupies to yield a felicitous combination. Therefore, the range of constructions a given verb is used in and the meaning of each of those constructions are revealing of the coarse-grained meaning contours of that verb. The results can then be used to delineate groups of near-synonymous verbs. On this approach, nearsynonyms share constructional properties, even though the extent to which a construction is typical for a given verb may vary and the individual lexemes differ as to how they are used within the shared constructional frames. To study verbal behavior within a shared constructional frame we build on earlier work by Divjak (2003, 2004, 2010), who constructed a database containing 1351 tokens of these verbs. Source of the data were the Amsterdam Corpus, supplemented with data from the Russian National Corpus, which contains written literary texts. About 250 extractions per verb were analysed in detail, except for poryvat’sja, which is rare and for which only half that number of examples could be found. Samples of equal size were chosen because of two reasons: 1) interest was in the contextual properties that would favour the choice of one verb over another, The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 9 and by fixing the sample size, frequency was controlled, 2) the difference in frequency of occurrence between these verbs is so large (see Table 6 below) that manually annotating a sample in which the verbs would be represented proportionally would be prohibitively expensive. The sentences containing one of the six TRY verbs were manually annotated for a variety of morphological, semantic and syntactic properties, using the annotation scheme proposed in Divjak (2003, 2004). The tagging scheme was built up incrementally and bottom-up, starting from the grammaticaland lexical-conceptual elements that were attested in the data. This scheme captures virtually all information provided at the clause (in case of complex sentences) or sentence level (for simplex sentences) by tagging morphological properties of the finite verb and the infinitive, syntactic properties of the sentences and semantic properties of the subject and infinitive as well as the optional elements. There were a total of 14 multiple-category variables amounting to 87 distinct variable categories or contextual properties. Divjak and Arppe (2013) used this dataset to train a polytomous logistic regression model (Arppe 2013a, 2013b) predicting the choice of verb. As a rule of thumb, the number of distinct variable combinations that allow for a reliable fitting of a (polytomous) logistic regression model should not exceed 1/10 of the least frequent outcome (Arppe 2008: 116). In this case, the least frequent verb occurs about 150 times, hence the number of variable categories should be approximately 15. The selection strategy we adopted (out of many possible ones) was to retain variables with a broad dispersion among the 6 TRY verbs. This ensured focus on the interaction of variables in determining the expected probability in context rather The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 10 than allowing individual distinctive variables, linked to only one of the verbs, to alone determine the choice. As selection criteria we required the overall frequency of the variable in the data to be at least 45 and to occur at least twice (i.e. not just a single chance occurrence) with all 6 TRY verbs. Additional technical restrictions excluded one variable for each fully mutually complementary case (e.g. the aspect of verb form – if a verb form is imperfective it cannot at the same time be perfective and vice versa) as well as variables with a mutual pair-wise Uncertainty Co-Efficient UC value (a measure of nominal category association; Theil 1970) larger than 0.5 (i.e. one variable reduces more than 1⁄2 of the uncertainty concerning the other). Altogether 18 variable categories were retained (11 semantic and 7 structural), belonging to 7 different types. These are listed in Table 1. Property Type 1 declarative sentence Structural 2 try verb in main clause 3 try verb in perfective aspect 4 try verb in indicative mood 5 try verb in gerund 6 try verb in past tense 7 subordinate verb in imperfective aspect 8 human agent Semantic 9 subordinate verb involves high control 10 subordinate verb designates an act of communication 11 subordinate verb designates an act of exchange 12 subordinate verb designates a physical action involving self 13 subordinate verb designates a physical action involving another participant 14 subordinate verb designates motion involving self 15 subordinate verb designates motion involving another participant 16 subordinate verb designates metaphorical motion 17 subordinate verb designates metaphorical exchange 18 subordinate verb designates metaphorical action involving other Table 1. Predictors used by the Divjak and Arppe (2013) model The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 11 Using the values of these variables as calculated on the basis of the data in the sample, the model predicts the probability for each verb in each sentence. More interestingly from an analyst's perspective, the model tells us how strongly each feature individually is associated with each verb (e.g. norovit' and especially poryvat'sja are strongly preferred when the infinitive describes a motion event while pytat'sja, starats’ja and silit'sja are dispreferred in this context; probovat' does not have a preference one way or the other). This enables us to characterize each verb’s preferences (Divjak 2010, Arppe & Divjak 2013, Arppe 2013b). Assuming that the model “chooses” the verb with the highest predicted probability (though strictly speaking a logistic regression model is attempting to represent the proportions of possible alternative choices in the long run), its overall accuracy was 51.7% (50.3% when tested on unseen data). This is well above chance: since there are six verbs, chance performance would have been at 16.7%. This overall accuracy may, however, still seem disappointingly low until we remember that the verbs have very similar meanings and are often interchangeable: that is to say, most contexts allow several, if not all, verbs. So the more interesting question is how the model's performance compares with that of humans. We explore this question in three studies. 3. STUDY 1 – FORCED CHOICE TASK In this study, we investigate Russian speakers' preferences for verbs of trying in specific sentential contexts using a force-choice task. We then compare the speakers' preferences to those of the model, asssuming that the model “prefers” the verb with the highest predicted probability. Obviously choosing a verb to go in a The final version will be available from http://www.degruyter.com/view/j/cogl.2016.27.issue-1/issue-files/cogl.2016.27.issue-1.xml 12 particular sentence is a fairly artificial task: it is not what speakers do during normal language use. However, a force-choice task provides useful information about speakers' preferences, and for this reason such tasks are routinely used in psycholinguistic research as well as in language testing. From our point of view, its major advantage is that it allows us to obtain comparable data from the model and from native speakers.
منابع مشابه
Support vector regression with random output variable and probabilistic constraints
Support Vector Regression (SVR) solves regression problems based on the concept of Support Vector Machine (SVM). In this paper, a new model of SVR with probabilistic constraints is proposed that any of output data and bias are considered the random variables with uniform probability functions. Using the new proposed method, the optimal hyperplane regression can be obtained by solving a quadrati...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملComparison between the effects of group-based acceptance and commitment therapy and group-based reality therapy on work-family conflict & psychological well-being of married female staffs
Background and Aim: Administration of the therapy protocols can increase psychological well-being with resultant decrease in the work-family conflict. The purpose of this study was to make a comparison between the effects of group-based acceptance and commitment therapy and group-based reality therapy on work-family conflict and psychological well-being with maintenance effect among married fem...
متن کاملThe Effectiveness of Reality Therapy based on Choice Theory on Psychological Capital of Orphan Adolescent Girls
The purpose of this study was to investigate the effectiveness of reality therapy based on choice theory on psychological capital of orphan adolescent girls. The research method was semi-experimental with pretest-posttest design and control group. The statistical population of this study included all adolescent girls aged 13-19 living in Mashhad welfare center. The statistical sample included 2...
متن کاملFunctional Distributional Semantics
Vector space models have become popular in distributional semantics, despite the challenges they face in capturing various semantic phenomena. We propose a novel probabilistic framework which draws on both formal semantics and recent advances in machine learning. In particular, we separate predicates from the entities they refer to, allowing us to perform Bayesian inference based on logical for...
متن کاملThe Effectiveness of Reality Therapy Based on Skill-training on Psychological Capital and Depression in Adolescent Girls with Type 1 Diabetes
Introduction: Diabetes is a chronic disease with a high risk of disability and death. In addition to physical complications, some psychological problems, especially depression and loss of psychological capital, are also common in people with diabetes. This study aims to the effectiveness of Reality Therapy Based on Skill-training on psychological capital and depression in female adolescents wit...
متن کامل